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ABSTRACT 

Not infrequently, investigators assume that 
reliability for groups is greater than reliability for persons, or 
that the error variance for groups is less than that for persons. 
Using generalizability theory, it is shown that this "conventional 
wisdom" is not necessarily true* Examples are provided from the 
course-evaluation and the performance-testing literature. In the 
cases considered in this paper, the conventional wisdom necessarily 
holds only for comparative statements about person versus group error 
variance when the universe of generalization has persons fixed and 
items random. In all other cases, the conventional wisdom may be 
false, in particular when the generalization is over both samples of 
persons and samples of items, which often represents the most 
sensible universe of generalization. An appendix elaborates on 
reliability. (Contains 8 references,) (Author/SLD) 



Reproductions supplied by EDRS are the best that can.be made '^^ 
* from the original document, 



TrA 



r- 
Q 



ACT Research Report Series 



93-10 



Some Measurement 
Characteristics of Aggregated 
Versus Individual Scores 



Robert L Brennan 



u.«. ocpAirrMiNT or cducation 

Offici o* Educ«lK>ni) HM«ifcn irKl (mprovement 
EDUCATIONAL RESOURCES IN FORMATION 

y CENTER (ERIC) 

(B^his docurntnt ha« b««n reproduced as 

received from the p«rton or wo^^'^'^o" 

originslir>g it 
D Mirror chenges »i»ve b«en m«d« to improve 

reproduction Quehty 

e Pointeof view or opinions steJed Jhtsdocu 
men! do iK>t r^ecesseniy represent oHiCiei 
OERt position Of policy 



"PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC)." 



December 1993 



V 




For additional copies write: 
ACT Research Report Series 
P.O. Box 168 
Iowa City, Iowa 52243 



'^1993 by The American College Testing Program, All rights reserved. 



Some Measurement Characteristics ol 
Aggregated Versus Individual Scores 



Robert L. Brennan 



American College Testing 



4 



Abstract 

Not infrequently, investigators assume that reliability for groups is greater than 
reliability for persons, and/or error variance for groups is less than error vanance for 
persons. Using generalizability theory, it is shown that this "conventional wisdom" is not 
necessarily true. Examples are provided from the course evaluation literature and the 
performance testing literature. 
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Some Measurement Characteristics of 
Aggregated Versus Individual Scores 

It is often stated that if a test is not reliable enough for making decisions about 
individuals, or if error variance for individuals is unacceptably large, then the test should 
be used only for making decisions about groups. Implicit in such statements is an 
assumption that reliability for groups is necessarily larger than reliability for persons, and 
error variance for groups is necessarily smaller than error variance for persons. In this 
paper, such statements or assumptions will be called the "conventional wisdom," The 
purpose of this paper is to show that this conventional wisdom is not necessarily true, and 
to identify specific conditions that lead to contradictions of this conventional wisdom. 

These issues are considered in the context of generalizability theory (Cronbach, 
Gleser, Nanda, & Rajaratnam, 1972), without a full explication of all the details of the 
theory. Readers unfamiliar with generalizability theory can consult Cronbach et aU 
(1972), Brennan (1992a), or Shavelson and Webb ( 1991). Also, many aspects of 
reliability (or generalizability) of group means have been treated by Kane and Brennan 
(1977), A brief introduction to generalizability theory is provided by Bennan (1992b). 

Generalizability Coefficients for Groups 
with Two Random Facets 

When persons (p) are nested within groups (g) and crossed with items (/), the 
design is denoted x /, and the linear model is 

= ^ ^ ^~ ^P:f ^ ^ ^ ^ni:f ■ ( " 

The terms to the right of the equality (except |i) are uncorrelated score effects with 
expectations of zero, and the V^pi.^r ^^^^'^ interaction effect confounded with other 
sources of error. The variances of these score effects are called variance components. 

erJc 6 



For this design, if groups arc the objects of measurement, then the universe ^^f 
genernHzation consists of the p and / fac^rts. It', in addition,/? and / are both random, tr.en 
the generalizahihty coetYicient for generalizing over samples of A items and n persons 
within each group is 

^\ 

^p; . (2) 

where a~. = Ik and ct*^, = a*^. Ik . 

It is important to note that n in Equation 2 is the number of persons within a group, not 
the larger number of persons across all groups. 

Assuming that p and / are both random implies that replications of the 
measurement procedure would involve different sets of persons and items or, 
equivalently, that an investigator wants to generalize to a larger set of persons and items 
than those in a particular measurement procedure. This assumption seems especially 
sensible for programs such as NARP, w hich uses a type of matrix sampling design. 

From the perspective of generalizability theory, traditional measurement error is 
most closely associated with generalizing over samples of items. It is important to note, 
however, that traditional measurement error is not necessarily the only, or even the most 
iir.ponanr, source of unreliability for inferences about group means. As noted by Feldt 
and Brennan (1989): 

The test results for any given year reflect not only the character of the 
instnictional program but also the character of students enrolled at that 
specific moment. These individuals must be regardc ^. as a sample, in a 
longitudinal sense, from the population that Hows through the district 
schools over a period of years. A curricular judgment can be in error if a 
particular year s class happens to be unusually strong or weak. Thus, even 



if authorities were privileged to know the true scores of current students, 
there could be substantial sampling error in using the results of one class 
to infer something about the impact of a program. An estimate of the 
reliability of class means must take this into account, (p. 127) 

In short, in most circumstances, generalizing over both items and persons seems sensible in 

examining the reliability of group means. 

Using the notational system introduced above, when persons within a single 

randomly selected group are the objects of measurement, the generalizability coefficient 

is 

P'g 

It is important to note that Equation 3 is for persons within a group, not across groups. 
The generalizability coefficient for all persons across groups is 



Eo^ = (4) 

g P-g 8' pig 
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Note that for both of the coefficients in Equations 3 and 4, persons are the objects of 
measurement and the universe of generalization involves the items facet, only. 

Usually, when comparative statements are made about reliability coefficients for 
groups and persons, the intended interpretation of reliability for persons is given by 
Equation 4. Therefore, a central focus of this paper is to compare Equation 2 and 
Equation 4. In particular, it is of interest to identify conditions under which Ep < £pj: . 

o / 

One such condition is = 0 , which is an unlikely occurrence implying that all group 

s 

means are equal, 

9 9 9 

The inequality < Ep^ is also true when k -^oo because in that case = 1 
and Ep^ = ^^^^ *^ ^^V^^ ^ ^' Consequently, it seems likely that long tests that are 
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highly reliable for decisions about persons will be less reliable for decisions about 
groups. For example, the following estimated variance components were obtained from 
an administration of the ACT Assessment Mathematics test in schools (i.e., groups) in a 
particular state: 



= .0016, or.^ .0329, cr. = .0009, and o^. ^ = .1809. 



The ACT Mathematics test contains /: = 60 multiple choice items, and the average 
number of students per school was about n = 145. For these values 
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£p; = .86<Ep^^ = .92 



Even with such a large value for n, Epj is still less than Ep~. in large part because the 
ACT Mathematics test is very reliable for person-level decisions. 

Equations 2 and 4 also imply that < £p^ is true when n = 1, which suggests 
that Ep^ < £p^ is more likely to be true for small values of n than for large values. 
However, in general, for 1 < n < oo and \ <k<co there appears to be no simple, necessar>' 
relationship among the variance components that guarantees that Ep^ < Ep . Even so, 
there are sufficient conditions that do pertain. One such condition is discussed next. 
P Sufficient Condition 

Note that, for a given value of ky the maximum value of Equation 2 occurs when 
n ^, in which case 



2 

(£pj!«->oo)=— ^ . (5) 
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Clearly, if (Epl | n -» o») < Ep^ then £pj < Ep^ . Therefore, whenever Equation 5 is 

o rot 

smaller than Equation 4, < £p^ . Consequently, we will focus on Equations 4 anc 
5 to obtain a sufficient condition for Epl to be smaller than £p^ . Letting 



/(: = c^a^.^ and (6) 



Equation 4 can be written 



,2, 



<t2 

^^^^ . (8) 



Equation 5 is smaller than Equation 8 when 

mm 



< 1 . 



which implies that (KL + A") < (^L + L). This inequality is true whenever K <L 



It follows that, Ep^ < Ep^ 



whenever 



or equivalently whenever 
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Loosely speaking, these results mean that reliability for groups is less than reliability lor 
persons whenever the proportion of persons' universe score variance attributable to 
groups is less than the proportion of persons' error variance attributable to groups, (This 
statement is "loose" primarily because it does not explicitly specify that generalization is 
over l)oth persons and items from the infinite universe of generalization for both facets.) 
The condition in Equation 9 or Equation 10 might hold, for example, if schools (i.e,, 
groups) have highly similar universe scores, but at a particular time students in different 
schools have been exposed to different subsets of the tested topics. 

Since = Q^/k and o^^.^ = ^^e right side of Inequality 10 is invariant 

over k , Consequently, this inequality is equivalent to 



^ ^ 2 < ^2 ^ " ' (11) 



which is sometimes more convenient to use. In short, £p^ < Epj^ if Inequality 9, 10, or 
1 1 holds. As noted previously, this is a sufficient condition for £p^ < Ep^ - it is by no 
means a necessary condition. That is, Ep^ can be smaller than even if Inequality 
9, 10, or 1 1 does not hold. 
Two Examples 

Discussed next are two examples that illustrate circumstances under which 
^P^ < Epf, • The first example is from the course evaluation literature. It illustrates a 
circumstance under which the sufficient condition in Equations 9, 10, or 1 1 is satisfied, 
The second example is from the performance testing literature. 

Example 1 . Kane, Gillmore, and Crooks (1976) studied the generalizability of 
class (i.e., group) means in the context of student evaluations of teaching. One of the 
questionnaires they studied was administered in all courses taught in the Physics 
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Department at the University of Illinois, Urbana-Champaign. "Fifteen classes that had 
twenty or more students were randomly selected, with the restriction that only one section 
taught by each instructor was included in the sample (Kane et al., 1976, p. 177)." Thus, 
there is a linking of each class with a unique instructor, and generalizations about class 
means are effectively generalizations about instructors. 

The questionnaire contained a set of /: = 8 "attribute" items (e.g., ability to answer 
questions) that were analyzed separately from other items. For these items. 



Using Equation 1 



^ _ _ 1S0< —Si = 15^ 

Therefore, these data satisfy a sufficient condition for £p^ < /fpj; , which means that h\-)~ 
< tp^^ for all pairs of values for n and k. For example, if n = 20 and /: = 8, 



=.65<Ep^ = .83 



Example 2 . Shavelson, Baxter, and Gao (1993) provide an extensive discussion 
of the sampling variability of performance assessments in the context of several data sets. 
One such data set is from the California Assessment Program (CAP) which conducted a 
voluntary statewide science assessment in 1989-1990 with approximately 600 schools. 
Students took five performance tasks involving identifying materials that serve as 
conductors, classifying leaves, identifying unknown rocks, estimating and measuring 
characteristics of water, and discovering reasons why fish were dying. The design 
employed by Shavelson et al. (1993) is mc e complicated than the ip:g) x ; design 
considered in this paper, but a subset of their results gives 
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^, = .07.a^,.^ = .23,^, = .O7.anda^,,^, = .43. 

In this case, 0^(0^ + o^.^, ) = .23, 0^/(0^. + c^.p = .14, and Inequality 1 1 is not 
satisfied. However, there are numerous combinations of values of n and k for which 
Ep~ < Epl . For example, when n = 20 and k = 5 

£p^: = .70 < £p2 = .75 . 

Recalling that the CAP involved k = 5 performance tasks, it can be shown that 
^P^ <^Pp whenever «< 33. Furthermore, for any value of fp^ < £p2 ^,hen 

< 14. In other words, no matter how many performance tasks are employed in the 
CAP. the conventional wisdom about group reliability being larger than person reliability 
will be incorrect if the number of persons within schools is less than 15. 

These examples illustrate that Epl < Ep^ is not simply a mathematical 
possibility. It is a likely occurrence in numerous circumstances, especially when 
variability of persons within groups, a^. , is relatively large. 

Error Variances for Groups With Two Random Facets 
Just as the conventional wi.sdom suggests that reliability for groups is greater than 
reliability for persons, so too investigators often implicitly or explicitly assume that error 
variance for groups is less than error variance for persons. However, as discus.sed next, 
this conventional wi.sdom about error variances is not necessarily true, either. 

The error variance a.s.sociated with Epj in Equation 2 is the relative error variance 
for group mean scores when generalization is over both persons and items: 
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For in Equation 4 the relative error Vciriance for person mean scores when 
generalization is over items is 



We wish to determine conditions under which a^(8 ) > ^(5 ). 
From Equarions 12 and 13, a"(6 ) > a"(5 ) whenever 

^ H P^ 



a^. /a2 + In > cr. 

P-K pl^K pl:g 



Multiplying both sides by n and collecling terms gives 



Dividing both sides by a^^.^ gives 



The left side of the above inequality is the signal/noise ratio associated with the 
generalizability coefficient in Equation 3 for the reliability of persons within a randomly 
selected group, Epj^,^ . Since a generalizability coefficient can be viewed as the ratio of 
signal (i.e.* universe score variance) to signal plus noise (i.e., relative error variance), 
Inequality 14 is equivalent to 

2 

^p:.^ n - 1 

Therefore, a necessarv condition for (r(S ) > 0^(5 ) is 

s ^ p^ 
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[These results for relative error variance, (r(6) , also hold for absolute error variance, 
o^(A), because for both groups and persons a^(A) differs from crib) by the same 
constant, o^^/k i.e. oHa^) - o\b^) = a'(A^) - a^(5^) = aj/k ,] 

Clearly, when n ^ Inequality 15 will not hold. So, for large values of n. it is 
reasonable to assume that error variance for persons is likely to be larger than error 
variance for groups. However, for small values of n, this need not be so. For example, 
using Inequality 15, a^(5 ) > a^(5 ) in the following cases: 
rt < 20 and Ep^.g = -95, and 
n<l() and £p2 = .90. 
These cases are not so extreme as to be entirely implausible, especially for long tests. 
Consequently, it is unwise to assume that error variance for person mean scores is always 
greater than error variance for group mean scores. 

Note that, if a^(b ) > a*^(5 ) it necessarily follows that Epl < Ep^ . To see this, 
recall from Equations 2 and 4 that universe score variance for groups can not be larger 
than universe score variance for persons. This guarantees that Ep^ < £p^ when 
cr^(5 ) > cy^(6 ) , In other words. Equation 15 is another sufficient condition for 
Ep] < Ep] , 

By contrast, if < f'Pp it does not necessarily follow that > ^(^^ • 

This is true for the Shavelson et al, (1993) example with n = 20 and /: = 5. In this case, it 
has already been shown that £pj < £p^^ , and c^(5^,) =.03 < cr(5p = ,10, Therefore, it 
is possible for group error variance to be smaller than person error variance (in accord 
with the conventional wisdom), while at the same time group reliability is less than 
person reliability (against the conventional wisdom,) 
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One Random Facet 

To this point it has been assumed that, when group means are the objects of 
measurement, both persons and items are randomly sampled from an infinite universe of 
generalization. Call this the "unrestricted" universe. It has been argued in the 
introduction to the previous section, that in most circumstances this unrestricted universe 
is s -nsible for examining the measurement charactristics of aggregated scores. However, 
there may be circumstances in which it is reasonable to consider a restricted universe of 
generalization in which persons are fixed and items are random. If so, then replications 
of the measurement procedure would involve the same persons but different sets of items. 
For this restricted universe, generalizability coefficients will be larger, and error 
variances will be smaller, than for the unrestricted universe. Consequently, it is more 
likely that for this restricted universe the conventional wisdom about group reliability and 
error variance holds. 

When persons are fixed and items are random, the generalizability coefficient for 
groups is 

and the associated relative error variance for group means is 



<^(%\n = oli^ol,.Jn . (17) 

In effect, fixing persons cause.*^ O^.^^ to move from relative error variance (see Equation 
12) to universe score variance. 

Comparing Equation 17 with Equation 13 shows that when generalization is over 
items only, a^(5^ 1 P) < CT^(6p as long as > 1. That is, the conventional wisdom about 
error variances holds. However, the conventional wisdom about reliability coefficients 
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does not necessarily hold. In particular, as shown in the appendix, (£pp | P) < £p^ when 
Inequality 9, 10, or 1 1 holds. That is, if Inequality 9, 10, or 1 1 holds, then it necessarily 
follows that (£p^ I P) < £p^ in the restricted universe. By contrast, Inequality 9, 10, or 

O r 

1 1 is only a sufficient condition for Ep^ < £p^ in the unrestricted universe. 

It is also possible to consider £p^ | / and <p-{?>„ 1 1) for the case when items are 

o o 

fixed and persons are random. If so, then replications of the measurement procedure 
would involve the same items but different sets of persons. This possibility is considered 
by Feldt and Brennan (1989, pp. 127, 135-136). It can be shown that there are conditions 

such that (Ep^ I / ) < £p^ and a^(5 I /) > g^(6 ) . However, there is a conceptual conflict 

s p s p 

in comparing the magnitude of with Ep^ I /, and the magnitude of 0^(5^) with 

/). The conflict arises because items are fixed for £p^ | / and o^(5 | /) , whereas 
items are random for £p^ and (5^{h^. Therefore, although statements can be made about 
the relative magnitudes of these quantities, such comparisons are likely to be misleading. 

Summary and Discussion 
It has been shown that when persons and items are random the conventional 
wisdom that £p^ > £p^ and ^^(S^) < ^(^p) necessarily hold. In particular, 

cA(a^ + a ^ ) < a^./(o^. + a ? ) is a sufficient condition for £p^ < £p^ . Even if this 
sufficient condition is not met, there can be various combinations of values for n and k 
such that £p^ < £p^ . Also, contrary to the conventional wisdom, <^(S^) > ^^(^p) 
whenever £p^.^ > {n - \ )/n , and for small values of n and long tests this condition might 
well be met. 

When persons are fixed and items are random, the conventional wisdom that 
cy^(5 ^\P) < ) is true provided n> \ . However, for this restricted universe of 
generalization, the conventional wisdom that (£p^. I P) > £p^ is false if 

o P 

g g P'g ^ gi^ gt P^'T 

In short, for the cases considered in this paper, the conventional wisdom 

necessarily holds only for comparative statements about person vs. group error variance 
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when the universe of generalization has persons fixed and items random. In all other 
cases, the conventional wisdom about reliabilitv coefficients and error variances may be 
false. In particular, the conventional wisdom may be false when generalization is over 
both samples of persons and samples of items, which often represents the most sensible 
universe of generalization. For this universe, the form of Equations 2, 4, 12, and 13 
clearly shows that o^.^ is incorporated in universe score variance when persons are the 
objects of measurement, whereas a is incorporated in error variance when groups are 
the object of measurement. Therefore, the magnitude of o^.^ is likely to be very 
influential in whether or not the conventional wisdom holds. 

As illustrated by the examples in this paper, violations of the conventional 
wisdom about group means are not merely mathematical possibilities — such violations 
are quite common, although they are seldom reported. 

Some of the results presented in this paper may seem to contradict the central 
limit theorem. In its simplest form, the central limit theorem implies that error variance 
for mean scores will be less than error variance for individual scores. This has been 
shown to be true for a universe of generalization in v/hich items constitute the only 
random facet, but not necessarily true for a universe of generalization in which both 
persons and items are random. One of the strengths of generalizability theory is that it 
permits an investigator to disentangle the amount of error attributable to multiple facets. 
This cannot be done (or at least not directly) using the simple form of the central limit 
theorem. 

Sometimes investigators appear to assume that statements about the relative 
magnitudes of reliability coefficients are interchangeable with statements about the 
relative magnitudes of the corresponding error variances. As discussed previously, this is 
not necessarily true. It is possible that Ep^ < Ep^ (against the conventional wisdom) 
while o^(5 ) < C7^(5 ) (in accord with the conventional wisdom). It is important. 
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therefore, that investigators not generalize from statements about reliability to 
statements about error variance, or vice-versa. 

Throughout this paper, emphasis has been on discussing conditions under which 
Eo^ < Eo^ and ) > G^(5 ) - i.e., conditions under which the conventional 
wisdom is reversed. Note, as well, that even if a reversal does not occur, aggregation to a 
group level may have relatively little impact on reliabiliry. For example, for the 
Shavelson et al. (1993) example introduced earlier, if ^ = 5, there is no value of n such 
that Ep^ is greater than £p^ by .10 or more, and n > 90 is required for £p^ to be greater 
than £p^ by .05. Furthermore, especially for relatively small values of k, Ep^ may be 
unacceptably small even if it is larger than £p^ . This is a distinct possibility in some 
performance testing contexts. 

Many writers, including this author, have argued that too frequently reliability 
coefficients are referenced in contexts where error variances would be more appropriate 
Also, in item response theory, there is little attention given to reliability coefficients. 
These two points may seem to suggest that the issues raised in this paper about reliability 
coefficients do not deserve much attention. Such a conclusion would be unfortunate. 
Reliability coefficients (as well as signal/noise ratios) have the distinct advantage of 
providing, in one statistic, a comparison between true (or universe) score variance and 
error variance, whereas examining error variance in isolation often leaves an investigator 
pondering whether or not error variance is to be considered large or small. This is a 
particularly important consideration when examining group means. Aggregation may 
well lead to a sizable decrease in error variance, but this can be very misleading if an 
investigator fails to take into account the corresponding decrease in true (or universe) 
score variance, in short, both reliability coefficients and error variances have utility for 
examining the measurement characteristics of aggregated scores versus individual scores. 
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A ppendix 

A Condition Under Which ( Ep^ \ P )< Ep^ 



Let K = <^^<^p.^ and L = <^^//c^p/.^ as in Equations 6 and 7, respectively. It follows 
from Equation 8 that 

2 

From Equation 16 

9 1 <^ + «J?7/^« 
£p I P = — ^- — ^ 

S cr^ + ayKn + a^^j + o^^j/Ln 
Kn +\\ 1 



fKn + l \ 
[ Kn 



(A2) 



In comparing Equations A 1 and A2, it is evident that (f'p^ | P) < Epj when 



L + 1 Ln + l 

< 



K+l ^ Kn + l ' 
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which is equivalent to 

(L + \){Kn+l)<(K + \)(Ln+\) 
LKn + L + Kn + \<LKn + K + Ln + \ 
Kn-K < Ln-L 
Kin - 1) < L(n- 1) 



K < L 



Therefore, Ep^ \ P < Epj when 



8 PS gl pl-g 



which is equivalent to 



S . < iJL 



as shown in the text leading to Equation 1 1 . 



